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Abstract-Amino  add  substitution  matrices  which  shows  the 
similarity  scores  between  pairs  of  amino  acids  have  been  widely 
used  in  protein  sequence  alignments.  These  matrices  are  based 
on  the  Dayhoff  model  of  evolutionary  substitution  rates.  Using 
machine  learning  techniques  we  obtained  three  dimensional 
representations  of  these  matrices  while  preserving  most  of  the 
information  obtained  in  the  matrices.  Vector  representation  of 
amino  acids  has  many  applications  in  pattern  recognition. 
Keywords  -  substitution  matrices,  machine  learning,  distance 
mapping. 

I.  Introduction 

Protein  similarity  score  matrices  are  constructed  from 
substitution  matrices  which  are  obtained  from  multiple 
alignment  of  several  evolutionally  related  sequences  [1-5]. 
The  substitution  matrices  are  derived  from  evolutionary 
amino  acid  substitution  frequencies  of  protein  sequences. 
Using  information  theory,  these  frequencies  are  converted 
into  similarity  scores.  These  scores  are  correlated  with  the 
physical  and  chemical  properties  of  amino  acids.  These  ma¬ 
trices  are  used  in  protein  sequence  comparison,  generating 
sequence  profiles  and  in  database  searches  for  similar  se¬ 
quences.  They  are  also  used  in  sequence  and  structural  pat¬ 
tern  recognition  problems  such  as  secondary  structure  pre¬ 
diction  and  finding  contact  maps  of  proteins. 

The  aim  of  this  work  is  to  simplify  the  representation  of 
the  twenty  amino  acids  in  a  metric  space  with  minimum  loss 
of  information.  What  is  required  is  a  mapping,  y  =  f  (x), 
where  the  input  value  will  be  the  similarity  score  from  a 
given  matrix  and  the  output  value  should  be  a  multi¬ 
dimensional  vector  for  each  amino  acid. 

Since  only  the  x  values  are  known,  the  problem  can  be 
viewed  as  an  unsupervised  learning  problem.  Machine 
learning  techniques  offer  several  alternatives  to  resolve  the 
problem.  The  typical  approach  is  to  refine  iteratively  the 
representation  of  symbols  into  multi-dimensional  space  by 
minimising  the  error  for  the  obtained  vectors  that  correspond 
to  the  given  similarity  scores. 

For  this  problem,  usage  of  artificial  neural  networks 
[6,7]  may  be  practical,  since  neural  network  learning  meth¬ 
ods  provide  a  robust  approach  to  approximating  real  and 
vector-valued  functions,  which  is  exactly  the  case  in  this 
problem.  The  fact  that  the  errors  in  training  examples  are 
tolerable  is  another  advantage  that  makes  neural  networks 
convenient,  because  the  score  matrices  are  not  guaranteed  to 
be  reducible  to  a  space  of  given  dimensions. 

On  the  other  hand,  usage  of  hybrid  techniques  that  com¬ 
bine  the  virtues  of  mathematical  verities  and  machine  learn¬ 
ing  techniques  is  also  possible.  The  score  matrix  can  be  put 
into  a  form  that  more  readily  reflects  an  N-dim  space  nature, 
then  this  new  form  of  representation  can  be  used  to  obtain 
the  reduced  number  of  dimensions  through  a  refinement  cy¬ 
cle  that  converges  to  the  target  function. 


Instead  of  going  from  similarity  matrix  into  a  metric 
space  and  then  into  vector  space,  we  can  directly  go  from 
similarity  score  matrices  to  multidimensional  space  using 
nonlinear  mapping  techniques.  This  way  we  eliminate  the 
risk  of  loosing  information  during  the  similarity  to  distance 
transformation  stage. 

II.  METHODOLOGY 

Linear  mapping  stage  takes  the  similarity  values  as  input 
and  uses  a  transformation  to  map  the  amino  acids  on  the  met¬ 
ric  space  relying  on  the  intuitive  result  that  distance  and 
similarity  are  opposite  concepts.  So  the  inverse  of  similarity 
should  define  distance,  meaning  that  the  higher  the  similarity 
score,  the  closest  the  symbols  are  expected  to  be  positioned 
in  space. 

To  transform  similarity  scores  to  distance  values,  several 
formulas  are  used,  out  of  which  the  ones  that  displayed  a 
better  performance  were  chosen.  The  performance  decision 
was  based  on  the  weight  ratio  of  the  greatest  N  eigenvalues 
to  the  20  eigenvalues  obtained  from  the  distance  matrix  for  N 
dimensions.  The  greater  the  value  of  the  ratio,  the  higher  the 
amount  of  the  original  information  that  is  conserved  during 
this  transformation.  The  top  three  performing  distance  for¬ 
mulas  are 

D,  (i,j)  =  1  /  (Sy  -  minS  +  offset)2 
D2  (i>j)  =  1  /  V  (Sy  -  minS  +  offset)3 
D3  (i,j)  =  1  /  V  ((Sy  -  minS  )2  +  offset) 

Where  Sy  is  the  similarity  score  between  amino  acids  i  and 
j,  and  minS  is  the  minimum  score  in  the  matrix.  Since  the 
scores  can  be  negative  and  distances  have  to  be  positive,  we 
subtracted  the  minimum  score  in  the  matrix  from  all  the 
scores  in  the  matrix.  Then  the  offset  value  is  added  to  over¬ 
come  the  division  by  zero  error. 

To  have  a  fair  representation  in  a  metric  space,  all  dis¬ 
tances  should  conform  the  triangular  inequality  rule. 

For  any  i,  j  and  k 

D,j  <  Dik  +  Djk 

A  linear  mapping  requires  that  this  condition  hold  for 
any  triple  distance  values.  Since  the  offset  value  is  the  only 
variable  in  the  equations,  we  tried  different  values  for  the 
given  scoring  matrix  until  the  triangular  inequality  rule  is 
conformed. 

After  the  linear  mapping  of  amino  acids  into  a  metric 
space,  we  calculate  the  eigenvalues  and  the  eigenvectors  of 
the  distance  matrix.  6.  To  map  the  data  symbols  into  N- 
dimensional  space,  eigenvectors  of  the  greatest  N  eigenval¬ 
ues  are  used.  Therefore  the  greatest  eigenvector  multiplied 
by  its  corresponding  eigen  value  is  the  x-coordinate,  the  sec- 
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ond  greatest  eigenvector  multiplied  by  its  eigenvalue  is  the 
y-coordinate  of  the  symbols  and  so  on.  The  eigenvectors  are 
the  new  coordinate  axes.  Thus,  we  obtain  the  initial  distance 
vectors  for  the  amino  acids  on  the  N  dimensional  space  de¬ 
fined  by  those  eigenvectors. 

These  initial  distance  vectors  are  then  subjected  to  an  it¬ 
erative  refinement  procedure.  In  this  procedure,  we  first  cal¬ 
culate  the  distance  (  Ly  )  between  amino  acid  i  and  j  in  the  N- 
dimensional  space.  Next,  we  calculate  the  total  error  of  trans¬ 
forming  into  N  dimensions  as  XX  (Ly-Dy  )2,  where  Dij  is  the 
actual  distance  between  amino  acid  i  and  j .  The  vectors  are 
updated  so  that  the  error  is  minimised.  This  procedure  con¬ 
tinues  until  the  vector  coordinates  converge  for  all  the  amino 
acids. 

While  transforming  the  similarity  matrix  into  distance 
matrix,  we  loose  some  information.  Similarity  of  each  amino 
acid  to  itself  is  different  for  each  amino  acid,  depending  on 
their  observed  substitution  frequencies  orchestrated  via  evo¬ 
lution.  On  the  other  hand,  ideally  the  distance  of  an  amino 
acid  to  itself  should  be  zero,  but  this  is  not  possible  when  we 
have  different  self-similarity  scores. 

We  avoided  this  problem  by  using  encoding-decoding 
technique.  The  main  motive  was  the  need  for  a  nonlinear 
method  so  that  distances  were  not  needed.  We  can  use  di¬ 
rectly  the  similarity  scores.  Second,  a  transformation  was 
desired  with  the  property  that  it  would  compress  the  data  and 
the  data  could  be  obtained  back  with  minimum  loss  and  at  a 
minimum  cost.  The  encoding-decoding  method  described 
below  satisfies  both  conditions:  It  is  nonlinear  and  its  cost  is 
the  time  required  to  train  the  multilayer  perceptron  which 
contains  2,  3  or  4  hidden  units  depending  on  the  number  of 
dimensions  we  choose  to  explain  the  symbols. 

As  seen  in  Fig.l,  the  algorithm  takes  the  similarity 
scores  and  the  dimension  of  the  vector  space  as  the  input 
values.  After  initializing  the  weights  and  the  learning  rate, 
the  algorithm  calculates  the  coordinates  as  the  sum  of  the 
products  of  encoding  weights  and  the  similarity  scores.  Out¬ 
put  values  are  determined  as  the  sum  of  the  products  of  de¬ 
coding  weights  and  the  coordinates.  The  error  is  the  differ¬ 
ence  between  the  input  and  the  output  values  of  the  amino 
acids.  We  then  take  partial  derivatives  of  the  error  function 
with  respect  to  the  encoding  and  decoding  weights  in  order 
to  minimise  the  error  function.  Then  update  the  weights  and 
the  learning  rate  factor.  We  continue  this  procedure  until  the 
vector  representation  for  each  amino  acid  converges. 

III.  RESULTS 

There  are  several  similarity  score  matrices  that  represent 
different  evolutionary  relations  and  are  obtained  by  using 
different  statistical  measures.  In  this  work,  we  used  four  ma¬ 
trices;  BLOSUM  45[1],  BLOSUM  60,  PAM  250  and  Day- 
hoff[4]  matrix.  We  used  several  distance  measures  in  map¬ 
ping  the  similarity  scores  into  the  metric  space.  The  mapping 
result  obtained  from  top  three-distance  measure  is  summa¬ 
rised  in  Table  1-3.  On  the  following  tables  (a)  stands  for  off¬ 
set  values,  (b)  is  number  of  triples  that  the  triangular  ine¬ 


quality  does  not  hold,  (c)  is  the  ratio  of  the  sum  of  the  great¬ 
est  3  eigenvalues  to  the  sum  of  all  eigenvalues. 

The  most  successful  distance  measure  for  three  dimen¬ 
sional  representation  of  amino  acids  was  D3  since  it  has  pre¬ 
served  the  highest  percentage  of  the  similarity  score  infor¬ 
mation  for  all  the  similarity  score  matrices  (Approximately 
60%)  except  BLOSUM45  Table  3. 


TABLE  1.  Mapping  with  the  formula  D1  =  1  /  (Sij  -  minS  +  offset)2 


Blosum45 

Pam250 

Blosum60 

Dayhoff 

a 

b 

c 

a 

b 

c 

a 

b 

c 

a 

b 

c 

8 

20 

52 

16 

8 

57 

7 

14 

58 

10 

76 

60 

9 

8 

52 

19 

2 

55 

8 

6 

57 

11 

52 

60 

10 

6 

51 

20 

0 

54 

9 

0 

57 

16 
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58 

12 

0 

51 

22 

0 

52 

10 

0 

56 

20 

0 

55 

TABLE  2.  Mapping  with  the  formula  D2  =  1  /  V  (Si,j  -  minS  +  offset)3 


Blosum45 

Pam250 

Blosum60 

Dayhoff 

a  b 

c 

a 

b 

c 

a 

b 

c 

a 

b  c 

6  8 

52 

9 

22 

57 

7 

14 

58 

10 

10  58 

7  4 

52 

11 

8 

57 

8 

6 

57 

12 

2  56 

8  0 

51 

13 

2 

57 

9 

0 

57 

14 

8  54 

9  0 

51 

14 

0 

56 

10 

0 

56 

15 

0  53 

TABLE  3.  Mapping  with  the  formula  D3  =  1  /  V  ((Si,j  -  minS  )2+offset) 


Blosum45 

Pam250 

Blosum60 

Dayhoff 

a  b 

c 

a 

b 

c 

a 

b 

c 

a  b  c 

10  2 

51 

20 

8 

59 

6 

2 

59 

17  10  60 

14  0 

52 

22 

0 

59 

7 

0 

58 

22  0  59 

Figure  1  Architecture  of  encoding  decoding  technique 


If  we  map  the  amino  acids  into  four  dimensional  space 
percentage  of  the  preserved  information  increases  by  7% 
across  the  board.  The  results  obtained  from  the  four  dimen¬ 
sional  representation  of  D3  are  summarised  in  Table  4.  The 
best  results  are  obtained  from  Pam250  and  Dayhoff  matrices 
even  though  the  difference  is  marginal. 

The  results  obtained  from  Pam  250  are  shown  in  Fig. 2. 
We  see  clustering  of  hydrophobic  residues  within  this  cluster 
we  see  additional  clustering  of  aromatic  residues.  Charged 
and  polar  amino  acids  also  form  a  cluster.  The  third  cluster  is 
formed  by  small  aliphatic  amino  acids.  In  evolutionary  data 
we  observe  accepted  mutations  within  these  clusters.  As  can 
be  seen  in  Fig. 2  the  amino  Cystine  stands  alone.  This  is  ex¬ 
pected  since  cystine  is  the  only  amino  acid  that  can  form 
disulfide  bond,  which  is  one  of  the  most  important  stabilising 
factors  for  the  protein  structure.  Therefore  Cystine  does  not 
like  to  be  substituted  by  other  amino  acids  since  they  are  the 
functional  sites  of  proteins. 

As  expected  increasing  the  dimensionality  decreases  the 
amount  of  lost  information.  The  sum  of  the  information 
content  of  other  dimensions  is  only  23%  of  the  total.  Going 
into  the  5th  dimension  improves  the  result  by  a  few  percent¬ 
age  points  in  each  case.  There  is  always  loss  of  some  infor¬ 
mation  in  going  from  similarity  to  distance  and  reducing  the 
dimensionality  introduces  new  additional  loss. 

To  overcome  this  problem  we  use  the  encoding  decoding 
technique.  This  technique  performs  direct  mapping  from 
similarity  score  into  N-dimensional  space.  Since  this  method 
converges  to  a  local  minima,  we  perform  several  runs.  Blo- 
sum  60  similarity  score  matrix  gave  the  best  clustering  of  the 
amino  acids  in  all  the  runs.  Even  in  two  dimensions  Blosum 
60  forms  the  expected  clusters  that  are  intrinsic  in  evolution 
Fig. 3.  But  we  cannot  observe  the  distinction  of  Cystine  from 
the  other  amino  acids  in  two  dimensions.  Adding  the  third 
dimension  separates  the  cystine  from  the  other  cluster  Fig.4. 


TABLE  4.  4-D  imensional  values 


Offset  Value 

Triangular  Inequality 

Eigenvalues  Info 

Blosum45 

14 

0 

59.2 

Pam250 

22 

0 

66.3 

Blosum60 

7 

0 

64.7 

Dayhoff 

22 

0 

66.9 

c# 
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Figure  3.  Blosum  60  using  encoding  decoding 

IV.  Discussion 

In  this  work  we  studied  three  methods  to  decrease  the  dimen¬ 
sionality  of  the  similarity  score  matrices.  Linear  mapping 
always  gave  consistent  results  bot  information  was  lost  while 
converting  the  similarity  to  distance.  Iterative  distance  re¬ 
finement  approach  uses  a  different  initial  condition  that’s 
why  better  mapping  could  be  as  a  result  of  better  search  of 
the  solution  space.  But  there  is  the  problem  of  convergence 
in  this  approach,  in  some  of  the  matrices  amino  acid  position 
vectors  did  not  converge  satisfactorily. 

Encoding  decoding  approach  gave  the  best  results,  because 
of  the  direct  transformation  from  similarity  to  metric  space. 
This  method  also  starts  from  several  initial  states  so  it 
searches  the  space  more  efficiently,  but  as  in  the  distance 
refinement  approach  it  falls  into  local  minima  and  it  has 
problems  of  convergence.  But  it  converged  faster  than  the 
distance  refinement  approach. 


Figure  2.  Blosum  60  with  distance  refinement 


Figure  4.  Blosum60  using  encoding  decoding  in  3-D 


V.  Conclusion 
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served,  the  clusters  represented  the  actual  groupings  of  the 
amino  acids  represented  in  the  score  matrices. 
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